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Abstract 

Partially  Observable  Markov  Decision  Process  models 
(POMDPs)  have  been  applied  to  low-level  robot  control.  We 
show  how  to  use  POMDPs  differently,  namely  for  sensor¬ 
planning  in  the  context  of  behavior-based  robot  systems .  This 
is  possible  because  solutions  of  POMDPs  can  be  expressed  as 
policy  graphs,  which  are  similar  to  the  finite  state  automata 
that  behavior-based  systems  use  to  sequence  their  behaviors. 

An  advantage  of  our  system  over  previous  POMDP  naviga¬ 
tion  systems  is  that  it  is  able  to  find  close-to-optimal  plans 
since  it  plans  at  a  higher  level  and  thus  with  smaller  state 
spaces.  An  advantage  of  our  system  over  behavior-based  sys¬ 
tems  that  need  to  get  programmed  by  their  users  is  that  it  can 
optimize  plans  during  missions  and  thus  deal  robustly  with 
probabilistic  models  that  are  initially  inaccurate. 

Introduction 

Mobile  robots  have  to  deal  with  various  kinds  of  uncertainty, 
such  as  noisy  actuators,  noisy  sensors,  and  uncertainty  about 
the  environment.  Behavior-based  robot  systems,  such  as 
MissionLab  (Endo  el  al.  2000),  can  operate  robustly  in 
the  presence  of  uncertainty  (Arkin  1998).  Its  operation  is 
controlled  by  plans  in  form  of  finite  state  automata,  whose 
states  correspond  to  behaviors  and  whose  arcs  correspond 
to  observations.  These  finite  state  automata  have  to  be  pro¬ 
grammed  by  the  users  of  the  system  at  the  beginning  of  a 
mission.  However,  plans  generated  by  humans  are  rarely  op¬ 
timal  because  they  involve  complex  tradeoffs.  Consider,  for 
example,  a  simple  sensor-planning  task,  where  a  robot  has 
to  decide  how  often  to  sense  before  it  starts  to  act.  Since  the 
sensors  of  the  robot  are  noisy,  it  may  have  to  sense  multiple 
times.  On  the  other  hand,  sensing  takes  time.  How  often  the 
robot  should  sense  depends  on  the  amount  of  sensor  noise, 
the  cost  of  sensing,  and  the  consequences  of  acting  based  on 
wrong  sensor  information. 

In  this  paper,  we  develop  a  robot  architecture  that 
uses  Partially  Observable  Markov  Decision  Process  models 
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(POMDPs)  (Sondik  1978)  for  planning  and  combines  them 
with  MissionLab.  POMDPs  provide  an  elegant  and  theoret¬ 
ically  grounded  way  for  probabilistic  planning  (Cassandra, 
Kaelbling,  &  Littman  1994).  So  far,  they  have  been  used 
mainly  to  solve  low-level  planning  tasks  for  mobile  robots 
such  as  path  following  and  localization  (Fox,  Burgard,  & 
Thrun  1998;  Mahadevan,  Theocharous,  &  Khaleeli  1998; 
Cassandra,  Kaelbling,  &  Kurien  1996;  Simmons  &  Koenig 
1995).  In  this  paper,  we  show  that  POMDPs  can  also  be 
used  to  solve  higher-level  planning  tasks  for  mobile  robots. 
The  key  idea  behind  our  robot  architecture  is  that  POMDP 
planners  can  generate  policy  graphs  rather  than  the  more 
popular  value  surfaces.  Policy  graphs  are  similar  to  the  fi¬ 
nite  state  automata  of  MissionLab.  An  advantage  of  our 
robot  architecture  is  that  it  uses  POMDPs  in  small  state 
spaces.  When  POMDPs  are  used  for  low-level  planning,  the 
state  spaces  are  often  large  and  finding  optimal  or  close-to- 
optimal  POMDPs  becomes  extremely  time-consuming  (Pa- 
padimitriou  &  Tsitsiklis  1987).  Thus,  existing  robot  systems 
have  so  far  only  been  able  to  use  greedy  POMDP  planning 
methods  that  produce  extremely  suboptimal  plans  (Koenig 
&  Simmons  1998).  Our  robot  architecture,  on  the  other 
hand,  is  able  to  find  close-to-optimal  plans. 

In  the  following,  we  first  give  an  example  of  sensor  plan¬ 
ning  and  then  give  overviews  of  behavior-based  robotics  and 
POMDPs  using  this  example.  Next,  we  describe  how  our 
robot  architecture  combines  these  ideas  by  transforming  the 
output  of  the  POMDP  planner  (policy  graphs)  to  the  input 
of  MissionLab  (finite  state  automata).  Finally,  we  report  on 
two  experiments  that  show  that  the  ability  to  optimize  plans 
during  missions  is  important  because  the  resulting  system  is 
able  to  deal  robustly  with  probabilistic  models  that  are  ini¬ 
tially  inaccurate. 

Example:  Sensor  Planning 

We  use  the  following  sensor-planning  example  throughout 
this  paper,  which  is  similar  to  an  example  used  in  (Cassan¬ 
dra,  Kaelbling,  &  Littman  1994).  Assume  that  a  police  robot 
attempts  to  find  wounded  hostages  in  a  building.  When  it  is 
at  a  doorway,  it  has  to  decide  whether  to  search  the  room. 
The  robot  can  either  use  its  microphone  to  listen  for  ter¬ 
rorists  (OBSERVE);  enter  the  room,  look  around,  leave  the 
room,  and  proceed  to  the  next  doorway  (ENTER  ROOM 
AND  PROCEED);  or  move  to  the  next  doorway  right  away 
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(PROCEED).  The  cost  of  OBSERVE  is  always  5,  and  the 
cost  of  PROCEED  is  always  50.  Each  room  is  occupied  by 
terrorists  with  probability  0.5.  OBSERVE  reports  either  that 
the  room  is  occupied  by  terrorists  (OBSERVE  OCCUPIED) 
or  not  (OBSERVE  EMPTY).  Although  the  microphone  al¬ 
ways  detects  the  absence  of  terrorists,  it  does  not  detect  the 
presence  of  terrorists  with  probability  0.2.  Multiple  obser¬ 
vations  are  drawn  independently  from  this  probability  distri¬ 
bution,  which  is  not  completely  unrealistic  for  sound.  The 
robot  gets  a  reward  for  entering  a  room,  as  an  incentive  to 
find  wounded  hostages.  However,  it  also  gets  a  penalty  if 
the  room  is  occupied  by  terrorists  since  terrorists  might  de¬ 
stroy  it.  If  the  room  is  not  occupied  by  terrorists  (ROOM 
EMPTY),  then  ENTER  ROOM  AND  PROCEED  results  in 
a  reward  of  100.  However,  if  the  room  is  occupied  by  terror¬ 
ists  (ROOM  OCCUPIED),  then  ENTER  ROOM  AND  PRO¬ 
CEED  results  in  a  penalty  of  500.  The  main  decision  that  the 
robot  has  to  make  is  how  often  to  OBSERVE  and,  depend¬ 
ing  on  the  sensor  observations,  whether  to  PROCEED  to  the 
next  doorway  right  away  or  to  first  ENTER  the  ROOM  AND 
then  PROCEED. 

Behavior-Based  Robotics 

Behavior-based  robotics  uses  a  tight  coupling  between  sens¬ 
ing  and  acting  to  operate  robustly  in  the  presence  of  uncer¬ 
tainty.  The  robot  always  executes  a  behavior  such  as  “move 
to  the  doorway”  or  “enter  the  room.”  To  sequence  these  be¬ 
haviors,  behavior-based  robotics  often  uses  finite  state  au¬ 
tomata  whose  states  correspond  to  behaviors  and  whose  arcs 
correspond  to  triggers  (observations).  The  current  state  dic¬ 
tates  the  behavior  of  the  robot.  When  an  observation  is 
made  and  there  is  an  edge  labeled  with  this  observation  that 
leaves  the  current  state,  the  current  state  changes  to  the  state 
pointed  to  by  the  edge.  Since  the  finite  state  automata  are 
based  on  behaviors  and  triggers,  the  robot  does  not  require 
a  model  of  the  world  or  complete  information  about  the  cur¬ 
rent  state  of  the  world.  For  example,  a  robot  does  not  need 
to  know  the  number  of  doorways  or  the  distances  between 
them. 

We  use  a  robot  system  based  on  MissionLab  (Endo  et  al. 
2000).  MissionLab  provides  a  large  number  of  behaviors 
and  triggers  with  which  users  can  build  finite  state  automata, 
that  can  then  be  executed  on  a  variety  of  robots  or  in  simu¬ 
lation.  The  finite  state  automata  have  to  be  programmed  by 
the  users  of  the  system  at  the  beginning  of  a  mission.  This 
has  the  disadvantage  that  MissionLab  cannot  optimize  the  fi¬ 
nite  state  automata  during  the  mission,  for  example,  when  it 
learns  more  accurate  probabilities  or  when  the  environments 
change.  Furthermore,  humans  often  assume  that  sensor  are 
accurate.  Their  plans  are  therefore  often  suboptimal.  We 
address  this  issue  by  developing  a  robot  architecture  which 
uses  a  POMDP  planner  to  generate  plans  based  on  proba¬ 
bilistic  models  of  the  world. 


al :  Enter  Room  and  Proceed 


ol:  Observe  Empty 

o2:  Observe  Occupied 

al:  Enter  Room  and  Proceed 

a2:  Observe 

a3:  Proceed 

si:  Room  Empty 

s2:  Room  Occupied 


q(ollsl,  al)  =  0.5 
q(olls2,  al)  =  0.5 
q(ollsl,  a2)  =  1.0 
q(olls2,  a2)  =  0.2 
q(ollsl,  a3)  =  0.5 
q(olls2,.a3)  =  0.5 


S={sl,  s2} 

0={ol,o2} 

A(sl)  =  A(s2)  =  (al,  a2,  a3} 


7t(sl)  =  0.5 
Jt(s2)  =  0.5 

p(sllsl,  al)  =  0.5 
p(sllsl,a2)  =  1.0 
p(sllsl,  a3)  =  0.5 
p(slls2,  al)  =  0.5 
p(slls2,  a2)  =  0.0 
p(slls2,  a3)  =  0.5 


Figure  1:  POMDP 


in  it.  The  POMDP  further  consists  of  a  transition  function 
p,  where  p(s'|s,  a)  denotes  the  probability  with  which  the 
system  transitions  from  state  s  to  state  s'  when  action  a  is 
executed,  an  observation  function  <7,  where  <7(o|s,  a)  denotes 
the  probability  of  making  observation  o  when  action  a  is 
executed  in  state  s,  and  a  reward  function  r,  where  r(s,  a) 
denotes  the  finite  reward  (negative  cost)  that  results  when  ac¬ 
tion  a  is  executed  in  state  s.  A  POMDP  process  is  a  stream 
of  <state,  observation,  action,  reward>  quadruples.  The 
POMDP  process  is  always  in  exactly  one  state  and  makes 
state  transitions  at  discrete  time  steps.  The  initial  state  of 
the  POMDP  process  is  drawn  according  to  the  probabilities 
7r(s).  Thus,  p(st.  =  s)  =  7 r(s)  for  t  =  1.  Assume  that 
at  time  t ,  the  POMDP  process  is  in  state  st  £  S.  Then, 
a  decision  maker  chooses  an  action  at  from  A($f  )  for  exe¬ 
cution.  This  results  in  reward  rt  =  r(st,at. )  and  observa¬ 
tion  ot  £  O  that  is  generated  according  to  the  probabilities 
p(ot.  =  o)  =  (?(o|st,  at).  Next,  the  POMDP  process  changes 
state.  The  successor  state  .s>+  i  €  S'  is  selected  according  to 
the  probabilities  p(st+i  =  s)  =  p(s|st,at).  This  process 
repeats  forever. 

As  an  example.  Figure  1  shows  the  POMDP  that  corre¬ 
sponds  to  our  sensor-planning  task.  The  robot  starts  at  a 
doorway  without  knowing  whether  the  room  is  occupied. 
Thus,  it  is  in  state  ROOM  OCCUPIED  with  probability 
0.5  and  ROOM  EMPTY  with  probability  0.5  but  does  not 
know  which  one  it  is  in.  In  both  states,  the  robot  can  OB¬ 
SERVE,  PROCEED  to  the  next  doorway,  or  ENTER  ROOM 
AND  PROCEED.  OBSERVE  does  not  change  the  state  but 
the  sensor  observation  provides  information  about  it.  PRO¬ 
CEED  and  ENTER  ROOM  AND  PROCEED  both  result 
in  the  robot  being  at  the  next  doorway  and  thus  again  in 
state  ROOM  OCCUPIED  with  probability  0.5  and  ROOM 
EMPTY  with  probability  0.5.  The  observation  probabilities 
and  rewards  of  the  actions  are  as  described  above. 


POMDPs 

POMDPs  consist  of  a  finite  set  of  states  S,  a  finite  set  of  ob¬ 
servations  O,  and  an  initial  state  distribution  7r.  Each  state 
s  £  S  has  a  finite  set  of  actions  A(s)  that  can  be  executed 


Policy  Graphs 

Assume  that  a  decision  maker  has  to  determine  which  ac¬ 
tion  to  execute  for  a  given  POMDP  at  time  t.  The  decision 
maker  knows  the  specification  of  the  POMDP,  executed  the 


Figure  2:  Policy  Graph 


actions  a\ . . .  at- 1,  and  made  the  observations  o±  . . .  ot- 1- 
The  objective  of  the  decision  maker  is  to  maximize  the  av¬ 
erage  total  reward  over  an  infinite  planning  horizon,  which 
is  E(Y^tLi  [7t-1G]).  where  7  £  (0, 1]  is  a  discount  factor. 
The  discount  factor  specifies  the  relative  value  of  a  reward 
received  after  t  action  executions  compared  to  the  same  re¬ 
ward  received  one  action  execution  earlier.  One  often  uses 
a  discount  factor  slightly  smaller  than  one  because  this  en¬ 
sures  that  the  average  total  reward  is  finite,  no  matter  which 
actions  are  chosen.  (We  use  7  =  0.99).  In  our  case,  the  robot 
is  the  decision  maker  who  executes  movement  and  sensing 
actions  and  receives  information  about  the  state  of  the  world 
from  inaccurate  sensors,  such  as  the  microphone.  We  let 
the  robot  maximize  the  average  total  reward  over  an  infinite 
horizon  because  it  searches  a  large  number  of  rooms. 

It  is  a  fundamental  result  of  operations  research  that  opti¬ 
mal  behaviors  for  the  robot  can  be  expressed  either  as  value 
surfaces  or  policy  graphs  (Sondik  1978).  Value  surfaces  are 
mappings  from  probability  distributions  over  the  states  to 
values.  The  robot  calculates  the  expected  value  of  the  prob¬ 
ability  distribution  over  the  states  that  results  from  the  exe¬ 
cution  of  each  action  and  then  chooses  the  action  that  results 
in  the  largest  expected  value.  Policy  graphs  are  graphs  where 
the  vertices  correspond  to  actions  and  the  directed  edges  cor¬ 
respond  to  observations.  The  robot  executes  the  action  that 
corresponds  to  its  current  vertex.  Then,  it  makes  an  obser¬ 
vation,  follows  the  corresponding  edge,  and  repeats  the  pro¬ 
cess. 

It  is  far  more  common  for  POMDP  planners  to  use  value 
surfaces  than  policy  graphs.  However,  policy  graphs  allow 
us  to  integrate  POMDP  planning  and  behavior-based  sys¬ 
tems  because  of  their  similarity  to  finite  state  automata.  As 
an  example.  Figure  2  shows  the  optimal  policy  graph  for  our 
sensor-planning  task.  This  policy  graph  specifies  a  behavior 
where  the  robot  senses  three  times  before  it  decides  to  en¬ 
ter  a  room.  If  any  of  the  sensing  operations  indicates  that 
the  room  is  occupied,  the  robot  decides  to  move  to  the  next 
doorway  without  entering  the  room. 

Optimal  policy  graphs  can  potentially  be  large  but  often 
turn  out  to  be  very  small  (Cassandra,  Kaelbling,  &  Littman 
1994).  However,  finding  optimal  or  close-to-optimal  pol¬ 
icy  graphs  is  PSPACE-complete  in  general  (Papadimitriou 
&  Tsitsiklis  1987)  and  thus  only  feasible  for  small  planning 
tasks.  We  decided  to  use  a  POMDP  planner  that  was  devel- 
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oped  by  Hansen  in  his  dissertation  at  the  University  of  Mas¬ 
sachusetts  at  Amherst  (Hansen  1998).  This  POMDP  planner 
can  often  find  optimal  or  close-to-optimal  policy  graphs  for 
our  POMDP  problems  in  seconds.  We  use  the  POMDP  plan¬ 
ner  unchanged,  with  one  small  exception.  We  noticed  that 
many  vertices  of  the  policy  graph  are  often  unreachable  from 
the  start  vertex  and  eliminate  these  vertices  using  a  simple 
graph-search  technique. 

The  Robot  Architecture 

Figure  3  shows  a  flow  graph  of  our  robot  architecture.  The 
user  inputs  a  POMDP  that  models  the  planning  task.  The 
robot  architecture  then  uses  the  POMDP  planner  to  produce 
a  policy  graph  and  removes  all  vertices  that  are  unreachable 
from  the  initial  vertex.  By  mapping  the  actions  of  the  policy 
graph  to  behaviors  and  the  observations  to  triggers,  the  pol¬ 
icy  graph  is  then  transformed  to  a  finite  state  automaton  and 
used  to  control  the  operation  of  MissionLab.  The  user  still 
has  to  input  information  but  now  only  the  planning  task  and 
its  parameters  (that  is,  the  probabilities  and  costs)  and  no 
longer  the  plans.  Once  the  finite  state  automaton  is  read  into 
MissionLab,  we  allow  the  user  to  examine  and  edit  it,  for  ex¬ 
ample,  to  add  additional  parts  to  the  mission  or  make  it  part 
of  a  larger  finite  state  automaton.  Figure  4  shows  a  screen- 
shot  of  the  policy  graph  from  Figure  2  after  it  was  read  into 
MissionLab  and  augmented  with  details  about  how  to  imple¬ 
ment  PROCEED  (namely  by  marking  the  current  doorway 
and  proceeding  along  the  hallway  until  the  robot  is  at  an 
unmarked  doorway)  and  ENTER  ROOM  AND  PROCEED 
(namely  by  entering  the  room,  leaving  the  room,  marking 
the  current  doorway,  and  proceeding  along  the  hallway  until 
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the  robot  is  at  an  unmarked  doorway).  Furthermore,  the  user 
decided  that  it  was  more  robust  to  start  with  the  behavior  that 
proceeds  along  the  hallway  because  then  the  robot  can  start 
anywhere  in  the  hallway  and  not  only  at  doorways. 

Our  robot  architecture  shows  that  it  is  possible  to  integrate 
POMDP  planning  and  behavior-based  systems,  by  specify¬ 
ing  the  solution  of  POMDPs  in  form  of  policy  graphs.  How¬ 
ever,  there  are  small  semantic  differences  between  policy 
graphs  and  finite  state  automata.  POMDPs  assume  that  ac¬ 
tions  are  discrete  and  that  the  robot  makes  an  observation 
after  each  action  execution.  Finite  state  automata  assume 
that  behaviors  are  continuous  and  triggers  can  be  observed 
at  any  time  during  the  execution  of  the  behaviors.  Two  issues 
need  to  be  addressed  in  this  context.  First,  we  need  Mission- 
Lab  to  be  able  to  deal  with  actions  of  finite  duration.  We 
deal  with  this  problem  by  adding  an  ACTION  FINISHED 
trigger  to  MissionLab.  (This  extension  is  not  needed  for 
our  sensing-planning  task.)  Second,  we  need  to  deal  with  a 
potential  combinatorial  explosion  of  the  number  of  observa¬ 
tions.  Most  POMDP  planners  assume  that  every  observation 
can  be  made  in  every  state.  Consequently,  every  vertex  in  a 
policy  graph  has  one  outgoing  edge  for  each  possible  obser¬ 
vation.  However,  the  observations  are  n  tuples  if  there  are 
n  sensors  and  the  number  of  observations  can  thus  be  large. 
This  is  not  a  problem  for  finite  state  automata  since  obser¬ 
vations  that  do  not  cause  state  transitions  do  not  appear  in 
them.  We  deal  with  this  problem  by  omitting  subtasks  from 
the  POMDP  planning  task  that  can  be  abstracted  away  or 
are  pre-sequenced  and  do  not  need  to  be  planned.  For  exam¬ 
ple,  ENTER  ROOM  AND  PROCEED  is  a  macro-behavior 
that  consists  of  a  sequence  of  observations  and  behaviors, 
as  shown  in  Figure  4.  By  omitting  the  details  of  ENTER 
ROOM  AND  PROCEED,  the  observations  IN  ROOM,  IN 
HALLWAY,  and  MARKED  DOORWAY  do  not  need  to  be 
considered  during  planning. 

Experiments 

We  test  the  performance  of  our  system,  both  analytically  and 
experimentally,  by  comparing  the  average  total  reward  of  its 
plans  (that  is,  optimal  plans)  against  the  plans  typically  gen¬ 
erated  by  users.  For  our  sensor-planning  task,  users  typically 
create  plans  that  sense  only  once,  no  matter  what  the  prob¬ 
abilities  and  costs  are.  The  robot  executes  ENTER  ROOM 
AND  PROCEED  if  it  senses  ROOM  EMPTY,  otherwise  it 
executes  PROCEED.  We  therefore  use  this  plan  as  baseline 
plan  and  compare  the  plans  generated  by  our  system  against 
it. 

Analytical  Results:  To  demonstrate  that  our  system  has 
an  advantage  over  the  previous  system  because  it  is  able 
to  optimize  its  plans  during  missions  when  it  is  able  to 
estimate  the  costs  more  precisely,  we  determine  analyti¬ 
cally  how  the  average  total  reward  of  the  baseline  plan  de¬ 
pends  on  the  reward  x  =  r(s2,ai)  for  entering  an  occu¬ 
pied  room.  Let  k  be  the  state  directly  before  the  robot 
executes  OBSERVE,  l  the  state  directly  before  it  executes 
ENTER  ROOM  AND  PROCEED,  and  to  the  state  directly 
before  it  executes  PROCEED.  If  the  robot  is  in  k,  then  it 
incurs  a  cost  of  5  for  executing  OBSERVE.  It  then  transi¬ 
tions  from  k  to  l  with  the  probability  with  which  the  sensor 
reports  OBSERVE  EMPTY,  otherwise  it  transitions  to  to. 
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Figure  5:  Average  Total  Rewards  vs.  Reward  for  Entering 
an  Occupied  Room  (x)  (Analytical) 

Using  the  notation  of  Figure  1,  the  probability  p(ot  =  0\) 
with  which  the  sensor  reports  OBSERVE  EMPTY  is  p(ot  = 
o\)  =  q(o1\s1,a2)p(st  =  si)  +  q(oi  |s2,  a2)p(st  =  s2)  = 
1.00.5  +  0.20.5  =  3/5.  Consequently,  the  average  total 
reward  v(k)  of  the  baseline  plan  if  the  robot  starts  in  k  is 
v(k)  =  — 5+7(3/5u(Z)  +  2/5u(to)).  Similar  derivations  re¬ 
sult  in  a  system  of  three  linear  equations  in  three  unknowns: 

v(k)  =  —  5  +  7(3/5u(I)  +  2/5v(m)) 

v(l)  =  l/6®  +  5/6100  +  7  v(k) 

v(m)  =  —  50  +  7v(fc) 

Solving  this  system  of  equations  yields  v(k)  =  1241.21  + 
4.97a:.  Figure  5  shows  this  graph  together  with  the  aver¬ 
age  total  reward  of  the  plans  generated  by  our  system,  as  a 
function  of  x.  As  can  be  seen,  the  number  of  times  a  robot 
has  to  sense  OBSERVE  EMPTY  before  it  enters  a  room  in¬ 
creases  as  it  becomes  more  expensive  to  enter  an  occupied 
room.  (The  markers  show  when  a  change  in  plan  occurs.) 
The  robot  pays  a  cost  for  the  additional  sensing  operations 
but  this  decreases  the  probability  of  entering  an  occupied 
room.  Changing  the  plans  as  x  changes  allows  the  average 
total  reward  of  the  plans  generated  by  our  system  to  deteri¬ 
orate  much  more  slowly  than  the  average  total  reward  of  the 
baseline  plan.  This  result  shows  that  our  system  has  an  ad¬ 
vantage  over  the  previous  system  because  it  is  able  to  adapt 
plans  during  missions  when  the  costs  can  be  estimated  more 
precisely.  It  also  shows  that  our  system  has  an  advantage 
over  the  previous  system  because  humans  are  not  good  at 
planning  with  uncertainty  and  thus  their  plans  are  rarely  op¬ 
timal.  For  example,  the  original  sensor-planning  problem 
has  x  =  —500  and  the  average  total  reward  of  the  baseline 
plan  is  only  -1,246.24  whereas  the  average  total  reward  of 
the  plan  generated  by  our  system  is  374.98. 

Similar  results  can  be  observed  if  the  initial  probabilities 
are  inaccurate.  Figure  6  shows  the  average  total  reward  of 
the  baseline  plan  together  with  the  average  total  reward  of 
the  plans  generated  by  our  system,  as  a  function  of  the  prob¬ 
ability  y  =  q(o2\s2,a2)  with  which  the  microphone  cor¬ 
rectly  classifies  an  occupied  room,  for  both  x  =  —200  and 
x  =  —500.  As  can  be  seen,  the  number  of  times  a  robot 


Figure  6:  Average  Total  Rewards  vs.  Probability  of  Cor¬ 
rectly  Classifying  an  Occupied  Room  (  Analytical) 
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Figure  7:  Average  Total  Rewards  (Experimental) 


has  to  sense  OBSERVE  EMPTY  before  it  enters  a  room  in¬ 
creases  as  the  sensor  becomes  more  noisy.  This  again  allows 
the  average  total  reward  of  the  plans  generated  by  our  sys¬ 
tem  to  deteriorate  much  more  slowly  than  the  average  total 
reward  of  the  baseline  plan,  demonstrating  the  advantages  of 
our  system. 

Experimental  Results:  We  also  performed  a  simulation 
study  with  MissionLab  to  compare  the  average  total  reward 
of  the  plans  generated  by  our  system  against  the  baseline 
plan,  for  y  =  0.8  and  both  x  =  —200  and  x  =  —500.  We 
used  four  rooms  and  averaged  over  ten  runs.  Figure  7  shows 
the  results.  In  both  cases,  the  average  total  reward  of  the 
baseline  plan  is  much  smaller  than  the  average  total  reward 
of  the  plans  generated  by  our  system.  (The  table  shows  that 
the  average  total  reward  of  the  plans  generated  by  our  system 
actually  increased  as  it  became  more  costly  to  enter  an  oc¬ 
cupied  room.  This  artifact  is  due  to  the  reduced  probability 
of  entering  an  occupied  room,  causing  the  situation  to  never 
occur  during  our  limited  number  of  runs.)  These  results  are 
similar  to  the  analytical  results  shown  in  Figure  5. 

Conclusion 

This  paper  reported  on  initial  work  that  uses  Partially  Ob¬ 
servable  Markov  Decision  Process  models  (POMDPs)  in  the 
context  of  behavior-based  systems.  The  insight  to  making 
this  combination  work  is  that  POMDP  planners  can  generate 
policy  graphs  rather  than  the  more  popular  value  surfaces, 
and  policy  graphs  are  similar  to  the  finite  state  automata 
that  behavior-based  systems  use  to  sequence  their  behaviors. 
This  combination  also  keeps  the  POMDPs  small,  which 
allows  our  POMDP  planners  to  find  optimal  or  close-to- 
optimal  plans  whereas  the  POMDP  planners  of  other  robot 


architectures  can  only  find  very  suboptimal  plans. 

We  used  this  insight  to  improve  MissionLab,  a  behavior- 
based  system  where  the  finite  state  automata  had  to  be  pro¬ 
grammed  by  the  users  of  the  system  at  the  beginning  of  the 
mission.  This  had  the  disadvantage  that  humans  are  not  good 
at  planning  with  uncertainty  and  thus  their  plans  are  rarely 
optimal.  In  contrast,  our  robot  architecture  does  not  only 
produce  close-to-optimal  plans  but  is  also  able  to  optimize 
the  finite  state  automata  when  it  learns  more  accurate  prob¬ 
abilities  or  when  the  environment  changes. 

It  is  future  work  to  study  interfaces  that  allow  users  to  eas¬ 
ily  input  POMDPs,  including  probabilities  and  costs.  Also, 
we  intend  to  implement  sampling  methods  for  adapting  the 
probabilities  and  costs  of  POMDPs  during  missions  to  be 
able  to  update  the  plan  during  execution.  Finally,  it  is  fu¬ 
ture  work  to  scale  up  our  robot  architecture  by  developing 
POMDP  planners  that  are  able  to  take  advantage  of  the  struc¬ 
ture  of  the  POMDP  planning  tasks  and  thus  are  more  effi¬ 
cient  than  current  POMDP  planners. 
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